arXiv:1508.01549vl [cs.LG] 6 Aug 2015 


U. KAMATH ET AL. 


1 


Theoretical and Empirical Analysis of a Parallel 

Boosting Algorithm 

Uday Kamath, Member, IEEE, Carlotta Domeniconi, Member, IEEE, and Kenneth De 

Jong, Life Fellow, IEEE 


Abstract —Many real-world problems Involve massive amounts of data. Under these circumstances learning algorithms often become 
prohibitively expensive, making scalability a pressing issue to be addressed. A common approach is to perform sampling to reduce the 
size of the dataset and enable efficient learning. Alternatively, one customizes learning algorithms to achieve scalability. In either case, 
the key challenge is to obtain algorithmic efficiency without compromising the quality of the results. In this paper we discuss a 
meta-learning algorithm (PSBML) which combines features of parallel algorithms with concepts from ensemble and boosting 
methodologies to achieve the desired scalability property. We present both theoretical and empirical analyses which show that PSBML 
preserves a critical property of boosting, specifically, convergence to a distribution centered around the margin. We then present 
additional empirical analyses showing that this meta-level algorithm provides a general and effective framework that can be used in 
combination with a variety of learning classifiers. We perform extensive experiments to investigate the tradeoff achieved between 
scalability and accuracy, and robustness to noise, on both synthetic and real-world data. These empirical results corroborate our 
theoretical analysis, and demonstrate the potential of PSBML in achieving scalability without sacrificing accuracy. 

Index Terms —Large Margin Classifier, Parallel Methods, Scalability, Machine Learning. 
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1 Introduction 

Many real-world applications, such as web mining, so¬ 
cial network analysis, and bioinformatics, involve massive 
amounts of data. Under these circumstances many traditional 
supervised learning algorithms often become prohibitively 
expensive, making scalability a pressing issue to be ad¬ 
dressed. For example, support vector machines (SVMs) have 
training times of 0{n^) and space complexity of 0{n?), 
where n is the size of the training set (T] . 

In order to handle the "big data problem", one of two 
approaches is typically taken: (1) Sampling of the data to 
reduce its size, or (2) customization of the learning algorithm 
to improve the running time via parallelization. Sampling 
techniques often introduce unintended biases that reduce 
the accuracy of the results. Similar reductions in accuracy 
often result from modifications to a learning algorithm to 
improve its speed. The second approach also lacks generality, 
and requires customization per learning algorithm. As such, 
it is highly desirable to obtain a general framework that 
can enable scalable machine learning with the following 
properties: (1) to be applicable to a large variety of algorithms; 
and (2) to keep high accuracy while achieving the desired 
speed-up. This paper addresses this need, while focusing on 
classification. 

Recently, a parallel spatial boosting machine learner 
(PSBML) was introduced j^, 13. PSBML is a meta-learning 
paradigm that leverages concepts from stochastic optimiza¬ 
tion and ensemble learning to arrange a collection of classi- 
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fiers in a two-dimensional toroidal grid. The key novelty and 
relevance of this framework is how scalability is achieved 
and the fact that it overcomes the major limitations of existing 
approaches to big data, i.e., sampling and customized parallel 
solutions. Furthermore, due to its distributed nature, the 
paradigm is especially effective with massive data. In this 
paper we provide both a theoretical and an empirical analysis 
of PSBML to investigate its behavior and properties. Our 
findings reveal that the significance of PSBML is two-fold. 
First, it provides a general framework for parallelization, 
and as such it can be used in combination with a variety of 
learners. Second, it is capable of achieving effective speed- 
ups without sacrificing accuracy. 

We use Gaussian mixture models (GMMs) combined 
with the mean-shift procedure to establish an analytical 
model of PSBML, and show that it converges to a data 
distribution whose modes are centered on the margin of 
the classification boundary. As such, the algorithm inherits 
the properties of good generalization and resilience to noise 
that are associated with large margin classifiers. We perform 
extensive experiments to evaluate the performance of PSBML 
with a variety of learners, to measure the speed and accuracy 
trade-offs achieved, and to test the effect of noise. All results 
confirm the strength of PSBML anticipated by our theoretical 
findings. 

This paper is a major and significant extension of our 
previous work ||2l, l3]. In particular, the novel contribution 
of this article is as follows. 

• Formal proof showing the convergence property of 
the PSBML algorithm and the fact that PSBML is a 
large margin classifier. 

• Extensive empirical experiments as detailed below. 

- Experiments using simulated data to verify the 
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established theoretical findings and properties 
of PSBML. The results confirm the expected be¬ 
havior as anticipated by the theory developed 
in this paper. 

- Extensive experiments on scalability comparing 
nine different classifiers custom optimized for 
speed, including different versions of support 
vector machines and parallel boosting. 

- Experiments on PSBML scalability as a func¬ 
tion of data size and number of threads. Our 
findings show that training time scales linearly 
with data size and it steadily improves with 
the number of threads. 

- Experiments of PSBML memory requirement 
as a function of data size. Our findings show 
a linear increase of the mean peak working 
memory with the training data size. 

- Additional experiments on parameter sensitiv¬ 
ity analysis, meta-learning, and noise impact, 
thus providing a comprehensive testing frame¬ 
work. 

2 Related Work 

The PSBML approach builds upon stochastic optimization 
techniques and ensemble learning. The specific stochastic op¬ 
timization technique used is spatially structured evolutionary 
algorithms, which embed the data in a metric space which 
constrains how samples may interact, and how are compared 
and updated lH, [5l. In ensemble learning, multiple classifiers 
are generated and combined to make a final prediction. It has 
been shown that ensemble learning, through the consolida¬ 
tion of different predictors, can lead to significant reductions 
in generalization error Of particular relevance is the 
Adaboost technique and its variants, such as confidence- 
based boosting [71. Adaboost induces a classification model 
by estimating the hard-to-learn instances in the training 
data fll A formal analysis of the AdaBoost technique has 
derived theoretical bounds on the margin distribution to 
which the approach converges 191. 

In statistical learning theory, a formal relationship be¬ 
tween the notion of margin and the generalization classifica¬ 
tion error has been established ITOl . As a result, classifiers 
that converge to a large margin perform well in terms of 
generalization error. One of the most popular examples 
of such classifiers is support vector machines (SVMs). The 
classification boundary provided by an SVM has the largest 
distance from the closest training point. SVMs have been 
modified to scale to large data sets (Til, GZ), [Tsl, (ill, IITsl . 
Many of these adaptations introduce a bias caused by the 
used approximation, like sampling the data or assuming a 
linear model, that can lead to a loss in generalization while 
trying to achieve speed. 

To achieve scalable solutions with large data, algorithm- 
specific customizations are performed to enable distributed 
architectures and network computing Ifl^ , (TtI , (Tsl , IIT9l . 
These modifications have been conducted on algorithms like 
decision trees, rule inductions, and boosting algorithms [201, 
m, m, In most cases, the underlying algorithm needs 
to be changed in order to achieve a parallel computation. 
Use of MapReduce with machine learning is a notable 


demonstration of this difficulty. MapReduce is a popular 
method for a divide-and-conquer-based parallelism and has 
been used in conjunction with machine learning algorithms 
to scale to large datasets l24l . [TSl . Distributed machine 
learning systems such as Mahout or Pegasus t25l , l26l sit 
on top of Hadoop, a common MapReduce implementation 
l27l . Many of the traditional machine learning algorithms 
need either significant per-algorithm customization, or must 
approximated to fit into the MapReduce framework. 

MapReduce based algorithms have the additional disad¬ 
vantage of unnecessary data movement and inefficiencies 
in iterative computation, which is a core part of most ma¬ 
chine learning algorithms (28]. Recent proposed alternatives, 
including modifications of MapReduce (for example (291 . 
(30l , (31]) can improve this situation significantly for highly 
iterative scenarios, reducing data movement by up to 1000 
times in some cases. These methods typically have the same 
disadvantage as MapReduce with regard to algorithmic 
customization. Einally the possible presence of heterogeneous 
nodes in clusters or cloud-based systems (with machines 
differing in terms of number of cores, RAM, and disk size) 
presents a challenge to such techniques. It can be difficult to 
achieve efficient utilization of heterogeneous node resources 
in divide-and-conquer methods resulting in either a poor 
utilization of resources or poor performance (281 , (32l , (33l . 

We argue that what is needed is a generic framework 
which can efficiently be deployed to a variety of machine 
learning algorithms, and still efficiently uses heterogeneous 
networks of nodes. The logical spatial grid of learners we 
discuss in this paper promises to achieve exactly this. 

In this paper, to analyse the PSBML algorithm we make 
use of Gaussian Mixture Models (GMMs) and the mean- 
shift procedure. A GMM is a parametric probabilistic model 
consisting of a linear combination of Gaussian distributions 
with unknown parameters. Typically the parameter values 
are estimated so that the resulting model is the one that best 
fits the data (341. 

Mean-shift is a local search algorithm whose aim is to find 
the modes (i.e. local maxima) of a distribution. It achieves 
this goal by performing kernel density estimation, and 
iteratively locating the local maxima of the kernel mixture 
as the zeros of the corresponding gradient function (35l . 
Convergence to local maxima is guaranteed from any starting 
point. Furthermore, it has been shown that, when combined 
with GMMs, mean-shift is equivalent to an expectation- 
maximization (EM) algorithm (36l . The key advantage of 
using the mean-shift algorithm for mode finding on a given 
density is two-fold: (1) the approach is deterministic and non- 
parametric, since it is based on kernel density estimation; 
and (2) it poses no a priori assumptions on the number of 
modes |^, [KJ. 

3 The PSBML Algorithm 

PSBML can be described as a meta-learning algorithm 
that arranges a collection of learners in a two-dimensional 
toroidal grid. It can use any classifier capable of producing 
confidence measures on predictions. Each learner works 
independently on a portion of the data, and shares its "results" 
only with its neighbor learners. Through this local interaction, 
the information discovered locally at each cell (i.e., learner) 
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Fig. 1. Two dimensional grid with various neighborhood structures. 


gradually travels throughout the entire grid. Eventually, the 
collaborative behavior of the learners enables the emergence 
of the crucial information to solve the problem at hand. The 
emerging behavior only requires local interaction among the 
learners, thus enabling a high degree of parallelism. 

In the following, we describe the different phases of the 
algorithm. The pseudo-code of PSBML is given in Algorithm 

[H 

3.1 Initialization 

Given a collection of labeled data, an independent fixed 
validation set is created and the rest is used for training. 
PSBML uses the concept of wrap-around toroidal grid to 
distribute the training data and the specified classifier to 
each node in the grid. The training data is distributed 
across all the nodes using stratified uniform sampling of 
the class labels (Line 1 of Algorithm [^. The parameters 
for grid configuration, i.e. width and height of the grid, 
replacement probability, and maximum number of iterations, 
are all included in GridParam. 

3.2 Node behavior at each epoch 

The algorithm runs for a pre-determined number of iterations 
(or epochs) (Line 4). The behavior of a node at each epoch 
can be divided into two phases: training and testing. During 
training, a node performs a standard training process using 
its local data (Line 5). For testing, a node's training data is 
combined with the instances assigned to the neighboring 
nodes (Lines 7 and 8). Each node in the grid interacts only 
with its neighbors, based on commonly used neighborhood 
structures as shown in Figure]^ Each classifier outputs a 
confidence value for the prediction of each test instance, 
which is then used for weighting the corresponding instance. 
Every node updates its local training data for the successive 
training epoch by probabilistically selecting instances based 
on the assigned weights (Line 9). 

The confidence values are used as a measure of how 
difficult it is to classify a given instance, allowing a node 
to select, during each iteration, the most difficult instances 
from all its neighbors. Since each instance is a member of the 
neighborhood of multiple nodes, an ensemble assessment of 
difficulty is performed, similar to the boosting of the margin 
in AdaBoost |91. Specifically, in PSBML the confidence csi of 
an instance i is set equal to the smallest confidence value 
obtained from any node and for any class: cSi = min^g^Vi c„i, 
where Ni is a set of indices defined over the neighborhoods 
to which instance i belongs, and Cni is the confidence credited 
to instance i by the learner corresponding to neighborhood n. 
These confidence values are then normalized through linear 
re-scaling: 


^^max ^^min 


where cSmin and cSmax are the smallest and the largest confi¬ 
dence values obtained across all the nodes, respectively. The 
weight Wi = {1 — cs™™) is assigned to instance i to indicate 
its relative degree of classification difficulty. The WiS are used 
to define a probability distribution over the set of instances 
i, and used by a node to perform a stochastic sampling 
technique (i.e. weighted sampling with replacement) to 
update its local set of training instances. The net effect is 
that, the smaller the confidence credited to an instance i 
is (i.e. the harder it is to learn instance i), the larger the 
probability will be for instance i to be selected. Instead of 
deterministically replacing the whole training data at a node 
with new instances, a replacement probability Pr is used. 
The effect of changing its value is discussed and analyzed in 
Section 16.11 Due to the weighted sampling procedure, and to 
the constant training data size at each node, copies of highly 
weighted instances will be generated, and low weighted 
instances will be removed with high probability during each 
epoch. 


3.3 Grid behavior at each epoch 

At each iteration, once all nodes have performed the local 
training, testing, and re-weighting, and have generated a 
new training dataset sampled from the previous epochs as 
described above, a global assessment of the grid is performed 
to track the "best" classifier throughout the entire iterative 
process. The unique instances from all the nodes are collected 
and used to train a new classifier (Lines 10 and 11). The 
independent validation set created during initialization is 
then used to test the classifier (Line 12). This procedure 
resembles the "pocket algorithm" used in neural networks, 
which has shown to converge to the optimal solution 11371 . 
The estimated best classifier is given in output and used to 
make predictions for unseen test instances (Line 17). 


3.4 Iterative process 

The weighted sampling process and the interaction of neigh¬ 
boring nodes enable the hard instances to migrate throughout 
the various nodes, due to the wrap-around nature of the grid. 
The rate at which the instances migrate depends on the grid 
structure, and more importantly on the neighborhood size 
and shape. Thus, the grid topology of classifiers and the data 
distribution across the nodes provides the parallel execution, 
while the interaction between neighboring nodes and the 
confidence-based instance selection give the ensemble and 
boosting effects. 


4 Theoretical Analysis: PSBML is a Large 
Margin Classifier 

We use Gaussian Mixture Models (GMMs) combined with 
the mean-shift algorithm to model the behavior of PSBML. 
Specifically, we formally show that PSBML, through the 
weighted sampling selection process, iteratively changes the 
data distribution, and converges to a distribution whose 
modes are centered around the margin, i.e. around the 
hardest points to classify. This is an important milestone, 
as it shows that PSBML inherits the properties of good 
generalization and resilience to noise that are associated 
with large margin classifiers, and thus further strengthen the 
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promise that our proposed framework will be an efficient 
and effective paradigm to perform scalable machine learning 
with massive data. 

Algorithm 1. PSBML(Train, Validation, GridParam) 

1: lNlTlALlZEGRlD(Train, GridParam) > Distribute the 
instances over the nodes in grid 
2: currentMin ^ 100 

3: Pr •<—GridParam.pr > Probability of replacement 

4: for i •<— 0 to GridParam.iter do > Train all nodes 

5: TRAlNNODES(GridParam) 

6: TESTANDWElGHNODES(GridParam) > Collect 

neighborhood data and assign weights 
7: PrunedData ^ { } 

8: for j ^ 0 to GridParam.nodes do 

9: NeighborData ^ CollectNeighborDataQ) 

10: NodeData ^ NodeData U NeighborData 

11: ReplaceData ^ WEiGHSAMPLiNG(NodeData, Pr) 

12: PrunedData ^ UNlQUE(ReplaceData)> Unique keeps 

one copy of instances in set 

13: ValClassifier •(— createNew(GridParam.classifier) > 

New classifier for validation 

14: error ^ VALlDATE(PrunedData,Validation,ValClassifier) 

t> Use validation set to track model learning 
15: currentMin ^ error 

16: bestClassifier •(— ValClassifier 

17: marginData ^ PrunedData 

18: return bestClassifier, marginData 

Each grid node in the PSBML algorithm, along with 
its neighborhood structure, represents a sample of the 
whole dataset, where each point is weighted according to 
how difficult it is to be classified. In our analysis, we fit 
a Gaussian mixture model on the weighted points, and 
apply the mean-shift procedure to locate the modes of 
the resulting distribution. We show that, throughout the 
iterations of PSBML, as more data closer to the boundary 
are being selected, the data distribution will grow higher 
modes centered around the margin. These modes will be the 
ones visited by the mean-shift procedure, irrespective of the 
starting point. 

Since each node in the toroidal grid has the same behavior, 
they all fit a Gaussian mixture model on their respective 
neighborhood. By consolidating the micro-behavior of the 
mean-shift procedure at each node, we obtain an overall 
convergence to a distribution with peaks centered around 
the boundary. Our analysis below, and the empirical results 
in Section]^ confirm this. 

4.1 Distribution of a node at time f = 1 

After the completion of the first iteration of PSBML, each 
classifier in the grid has been trained with its own data, 
and is tested on the instances of the neighbors, to which it 
assigns confidence values. A common approach to assess the 
confidence of a prediction for an instance is to measure its 
distance from the estimated decision boundary: the smaller 
the distance, the smaller the confidence will be. The resulting 
weight values drive the probability for a point to be selected 
for the successive iterations. Below we use a Gaussian 
mixture to model this process. 


Gonsider a Gaussian mixture density of M components 
P(^) = J2m=i where the p{m) are the mixture 

pro|iortions such that p{m) > 0, Vm = 1,..., M, and 
= 1. Each mixture component is a Gaussian 
distribution in R-®, i.e. x|m ~ Tim), where = 

Ep(x|m)[x] and Tm = Ep(x|,„)[(x - /x„,)(x - /r„,)^] are the 
mean and covariance matrix of the Gaussian component m. 

Let us first consider a known result for the mean-shift 
procedure applied to Gaussian mixture models to find the 
modes of the distribution f35l . No closed-form solution exists 
to this problem, so numerical iterative approaches have been 
developed. In particular, the fixed-point iterative method 
gives the following fixed-point solution (351 : = f(x*^‘^) 

where 

C M M 

^p(to|x)E-M ^p(m|x)E-Vm (1) 

m—1 / m—1 

Let us assume now that we model the sample data 
assigned to a node and to its neighbors using a Gaussian 
mixture distribution of M components in . In our anal¬ 
ysis, we consider only the distribution of one class; the 
argument stays the same for the other class due to the 
symmetry with respect to the boundary. We need to embed 
the weighted sampling process performed by PSBML in 
our Gaussian mixture modeling. Lets assume the optimal 
boundary between classes is known. Let s G R^ be a point 
on the boundary. We estimate the distance of a point x 
from the boundary by considering its distance from s. At 
each iteration of the PSBML algorithm, the weights bias 
the sampling towards those points which are closer to the 
boundary: the larger the weight of a point is, the larger is the 
probability of being selected. To embed this mechanism in 
the Gaussian mixture modeling, we set the component 
to be p'(x|m) = w(x) * p(x|m), where w(x) is a Gaussian 
weghting function centered at s: 

w(x) = ( 27 r)-'°/ 2 | 5 ]^|-i/ 2 g-i/ 2 (x-s)^srhx-s) 

and 

p(x|m) = ( 27 r)-^/ 2 | 2 ^|-i/ 2 g-i/ 2 (x-M„)^E-hx-M„) 

We compute the gradient of p'(x|to) with respect to the 
independent variable x, while keeping the parameters fim 
and Tm fixed: 

apW") + p(x|™) (2) 

Considering each derivative: 

= p(x|m)E-i(/x„ - x) 


dw{x) 

dx 


= w{x)T^ (s-x) 


and substituting these results in equation we obtain: 


dp'{x\m) 

dx 


w{x)p{x\m)Tj{fi^ - x)+ 
p{x\m)w{x)Tf^ (s — x) 


( 3 ) 
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We now turn to the mixture of M Gaussian distributions. 
By the linearity property of the differential operator, we 
obtain: 


dp{x) 

9x 


M 


= 'fi'(x) - x) + 

m—1 

M 

(s - x)w(x) p(m)p(x|m)S^^ 


By setting the above gradient to 0 and simplifying w(x), 
we derive a fixed point iteration procedure that finds the 
modes of the distribution [|35| : 

M 

Y P("^)p(x|m)E-^(/x^ - x) = 

m—1 

M 

(x - s) ^ p{m)p{x\m)i:~^ 

m—1 


Solving for x, we obtain: 


4.3 Final Distribution of the Grid 

After a number of iterations, at each node, data will be 
sampled according to the current distribution. We can show 
that all the nodes will converge to the same mode. Suppose 
that a node i, at time t, has a neighborhood with means 
T{t) = ..., one of these means, say pSip, 

is the closest (globally) to the boundary. During successive 
iterations, the sampling process causes the elimination of 
modes that are far from the boundary. Thus, after fc > 0 steps, 
the local distribution of node i will have a smaller number of 
modes: T{t+k) — ..., with Z —m > 0. Due 

to the weighted sampling mechanism (note that the sample 
size remains constant at each iteration), ^ 

T(f + fc). The whole process converges when T(f + 1) = T{t), 
or the mean shift is negligible, and at convergence T{t) = 
{/Xg*^}. We observe that spatially structured replication-based 
evolutionary algorithms show a similar behavior, where the 
global best is spread deterministically across the nodes, until 
all the nodes in the grid converge to the same individual 
according to logistic takeover curves l 38l . 


M M 

^-1 = 


E p(m)p(x|m)SEs+ E Pi'm)p{x\m)T,^^ 

m—1 m—1 

M M 

E p(m)p(x|m)E^^ -I- E p(w)p(x|m)SE 


Using the Bayes rule and simplifying p(x): 


M M 


E p(to|x)EEs-H E P(m|x)Ej/x„ 

m—1 m—1 


M M 

^-1 


E p(to|x)Es -h E p{m\x)E. 


-1 


Rearranging, we obtain our fixed-point solution: 


M 


M 


^-1 


X = I y] p(to|x)e^ 1 -I- Y pMy^)K 

\m—l m—1 J 

/ M M 

y p(m|x)E-is+ y p(m|x)E^^/x,. 


\m—l 


m—1 


(4) 


Comparing equations Q and Q we can see that, by 
weighting the points according to their distance from the 
boundary, the modes of the resulting distribution become 
the weighted average of the means /x^ and s. That is, each 
local classifier, by assigning weights to points according to 
the confidence of the prediction, causes the modes to shift 
towards the points closest to the estimated boundary, i.e. 
towards its margin. 


4.2 Distribution of the Grid at time f = 1 

The whole grid itself is modeled as a Gaussian mixture (given 
by the collection of GMMs at each node). Thus, the same 
derivation given above, applied to the grid, shows that the 
overall data distribution wiU have the same modes emerging 
from the individual nodes, i.e. centered around the margin 
of the boundary. 


5 Empirical Analysis of PSBML and GMM 
WITH Mean-shift 

We performed a number of experiments to verify the estab¬ 
lished relationship between PSBML and GMMs with mean- 
shift. We generated synthetic data on which we ran the 
following experiments. 

1) We ran the PSBML algorithm using a 5 x 5 spatial 
grid with the C9 neighborhood (see Fig. and a 
large margin classifier, and observed the population 
distribution change over the training epochs. 

2) We replaced each local classifier with a GMM with 
mean-shift, while keeping the grid structure and 
neighborhood interaction unchanged. Each data 
instance is weighted a priori using the Gaussian 
weighting function as defined in the theoretical 
analysis. We ran GMM with mean-shift on each node 
and performed sampling iteratively at every training 
epoch exactly as in PSBML. We again observed the 
changes in the population distribution over time. 

3) We removed the grid and ran GMMs with mean- 
shift estimation on the whole dataset, with each 
instance weighted according to its distance from 
the known boundary as above. We observed the data 
distribution and final modes at convergence, and 
compared them with those obtained in the previous 
setting. 

5.1 A Non-linearly Separable Dataset 

Instances were drawn at random within a square centered at 
the origin and with side of length two. Points with a distance 
smaller than 0.4 from the origin are labeled as negative, and 
those with a distance greater or equal than 0.4 are labeled 
as positive (see Figure |^. We ran the three experiments 
described in SectionjS on this data. For experiment 1, the large 
margin classifier used at each node fits a circle to its training 
set by setting its radius to the average distance of the origin 
from the smallest positive and the largest negative instances. 
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Fig. 2. Circle dataset. 




Circle Radius at different levels 


Fig. 3. Circle dataset: Data distribution at epochs 25 (Top) and 50 (Bottom) 
using PSBML and GMMs. 


For testing, the learner outputs " when the instance falls 
within the circle, and “+" otherwise. The confidence of the 
prediction is the distance of the instance from the circular 
boundary. 

To compare the data distributions obtained in experi¬ 
ments 1 and 2, we recorded the number of points at various 
intervals of distances from the origin at training epochs 25 
and 50. The resulting histograms are given in Figure]^ We 
can clearly observe that the two methodologies, PSBML 
and GMMs with mean-shift, provide a nearly identical 
distribution at both generations, and they converge to a 
distribution with modes centered on the points closest to the 


boundary. 

For experiment 3, we ran GMMs with mean-shift estima¬ 
tion 30 times on the whole weighted data. The means of the 
modes at convergence were (—0.01,0.38) and (0.01, —0.41), 
with a very small standard deviation of 0.03. The distribution 
at convergence was very close to those obtained in exper¬ 
iments 1 and 2. Interestingly, we observed that, when the 
weights were removed, the modes at convergence moved to 
(-0.03,0.51) and (0.03, -0.49). 

5.2 Weight Distribution Changes 

One important property of boosting is to scale the weights 
of data as a function of its distance from the margin. To 
observe the effect of weight changes, in Figure [^we plotted 
the weights of all points at different radii and for different 
generations for the circle dataset (Figure]^. We can clearly 
see an exponential decay and a logistic increase based on 
the vicinity to the margin of the data. For positive points, 
when the radius is between 0.3 and 0.4, and for negative 
points, when the radius is between 0.4 and 0.5, an increase is 
seen with time, and for the rest there is an exponential decay, 
confirming a behavior analogous to boosting. 
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Fig. 4. Changes in weight distribution as function of time: (Top) exponen¬ 
tial decay; (Bottom) logistic increase. 


5.3 Linearly Separable Bivariate Gaussians 

We created a synthetic dataset consisting of 5 Gaussian dis¬ 
tributions for each class, with roughly the same density but 
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Fig. 5. Bivariate Gaussian dataset. 


different shapes (see Figure^. The Gaussian distributions 
with means (14, 8) and (24,8jare the closest to the boundary, 
given by the line x = 20. They simulate the "global modes". 
We again ran the three experiments described in Section 
The large margin classifier was simulated by estimating 
the average distance between the smallest positive and the 
largest negative instances. 

Again we observed that the data distributions produced 
by PSBML and GMM with mean-shift and grid structure are 
very much alike, as illustrated in Figure For experiment 3, 
with 30 runs on the weighted dataset, the modes of the data 
distribution converged to (14.02,7.89) and (24.09,7.88), 
with deviation of 0.002, matching exactly our results for 
experiments 1 and 2. 



•^COCMCDCMCDOTrCOiM 
V V->-->-CMCMCOCOCO'^ 

VVVVVVVV 

X dimensions of Data 



Fig. 6. Linearly separable Gaussian dataset: Data distribution at epochs 
25 (Top) and 50 (Bottom) using PSBML and GMMs. 



5.4 Hard Instances and Support Vectors 

We also analyzed the data distribution at convergence by 
comparing the hard instances identified by PSBML with 
the support vectors of a trained SVM. Table shows the 
percentage of overlap for the two simulated datasets. The 
support vectors of the trained SVMs with the highest a (i.e. 
weight) values correspond to the hard instances with the top 
10% largest weights identified by the PSBML algorithm for 
both the datasets. 

TABLE 1 

Overlap percentage between support vectors and PSBML hard 
instances. 

2D Circle 2D Gaussians 

Sy overlap 90% 94% 


6 Experimental Results 

We ran all scalability experiments (where rurming times 
were measured) on a dual 3.33 GFlz 6-core Intel Xeon 5670 
processor with no hyperthreading. This means that we had 
a maximum of 12 hardware threads available. PSBML was 
implemented both as a single threaded Weka l3^ classifier 
and as a multithreaded standalone Java program that could 
run on any JVM version above 1.5 (see Section]^. AU experi¬ 
ments with PSBML were run using a maximum heap size of 
8GB and a number of threads equal to the number of nodes 


in the grid. All SVMs and boosting implementations, where 
running times were compared, used either the native Matlab 
or C++ code, except for AdaBoostMl, where Weka 3.7.1 was 
used. All statistical significance tests were performed using 
the Matlab paired t-test function. 

6.1 Parameter Sensitivity Anaiysis 

To study the effect that the neighborhood structure of the 
grid has on the performance of PSBML, we ran experiments 
on the UCI Chess (King-Rook vs. King-Pawn) dataset, 
which consists of 3196 instances, 36 attributes, and 2 classes. 
PSBML was run on this problem using various neighborhood 
structures, and the results are shown in Fig. A 5 x 5 grid 
was used with a Naive Bayes classifier with discretization 
for numeric features. PSBML was evaluated by combining 
the instances selected by all the nodes at each epoch; using 
this collection of instances, we trained a single classifier 
and tested its performance on the test set. Although the 
average size reduction of the training dataset was quite 
similar for all the neighborhoods, their classic "over-fitting 
curves" were different (see Fig. 0. The notion of selection 
pressure controlled by the parameter Pr gives the degree to 
which only the highly weighted instances are selected at each 
epoch. Since the sample selected at each node has a constant 
size, the selection pressure is driven by the size of the pool 
we choose the sample from. Furthermore, the more spread 
the neighborhood is, the faster the highly weighted instances 
travel through the grid. As such, the neighborhoods L9 and 
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Generations 

Fig. 7. Error rates at successive epochs for different neighborhood 
structures. 



Fig. 8. Error rates at successive epochs for different Pr vaiues. 

C13 have a stronger sampling pressure. They produced a 
more rapid initial decrease in test classification error rates, 
which subsequently increased more rapidly as the training 
data became too sparse. The simplest L5 neighborhood 
reduced classification error rates too slowly. The best results 
were obtained with the neighborhood structure C9. 

We used the UCl Chess dataset to also investigate how 
the rate of replacement Pr affects the performance of PSBML. 
Figures and illustrate that increasing the value of Pr 
results in faster convergence rates, but also in less accurate 
models. The best results were obtained when Pr = 0.2, 
which is the value we use in our experiments. 

Finally, to investigate the impact of the grid size on 
accuracy, we used the Chess and Magik datasets, both 
with different training data sizes. The Magik dataset has 
17,116 instances, 10 attributes, and 2 classes. We used the 



Generations 

Fig. 9. Number of distinct instances sampled at successive epochs for 
different Pr values. 

C9 neighborhood configuration and fixed the value of the 
replacement rate Pr to 0.2. We tested various grid sizes 
ranging from 3 x 3 to 7 x 7. We measured the AUC for 
PSBML over 30 runs. The Naive Bayes classifier (with the 
same configuration) was used at each node of the grid. Table 
[^summarizes the results, showing that there is no statistically 
significant difference in AUC values across the various grid 
sizes. 

These results are not surprising. Given the wraparound 
nature of the grid, and the diffusion of hard instances 
through the weighted sampling process, only the rate of 
convergence to the margin is affected by the grid size. For 
the Chess dataset, which has only 3,196 instances, as the 
number of nodes increases, we observe a slight degradation 
in performance. This is because the training data available at 
each node reduces significantly, and as a result the classifier's 
VC bound comes into effect M- Thus, for smaller datasets, 
the choice of the grid configuration may depend on this lower 
bound. With a larger dataset like Magik (17,116 instances), 
no degradation is observed. This is an important insight for 
the practitioner dealing with massive data, as scaling based 
on the number of hardware cores available can be used to 
configure the grid size. 


TABLE 2 

AUC results for PSBML with different grid sizes. 


Datasets 

3x3 

4x4 

5x5 

6x6 

7x7 

Chess 

98.5 

98.5 

98.3 

98.2 

98.1 

Magik 

89.4 

89.5 

89.4 

89.4 

89.5 


6.2 Meta-learning Experiments 

The goal of this experiment is two-fold: first to validate that 
PSBML provides a general framework for meta-learning, 
and therefore can be used in combination with a variety 
of learners; second, to verify that it's an effective parallel 




















U. KAMATH ET AL. 


TABLE 3 

UCI datasets used in the experiments 


9 


Adult W8A 

# Train 32560 49749 

# Test 16279 14951 

# Features 123 300 

# Labels 2 2 


ICJNNl Cod Cover 

49990 331617 581012 

91701 59535 58102 

22 8 54 

2 2 7 




TABLE 4 

Meta-learning results (AUC) comparing the base classifiers and PSBML 
combined with the same. 



Adult 

W8A ICJNNl 

Cod 

Cover 

NB 

90.1 

94.30 

81.60 

87.20 

84.90 

PSBML 

90.69 

96.10 

81.79 

91.79 

87.31 

C4.5 

88.01 

87.80 

94.60 

95.90 

99.50 

PSBML 

88.78 

84.80 

97.30 

97.24 

97.44 

Linear SVM 

54.60 

80.20 

64.60 

88.80 

72.20 

PSBML 

60.01 

80.70 

64.80 

95.10 

79.10 


algorithm, i.e., it provides accuracy results comparable 
to the sequential counterpart, while achieving a speedup. 
To illustrate this, we performed experiments using three 
base classifiers: Naive Bayes, Decision Trees (C4.5), and 
Linear SVMs (LibLinear vl.8) (the corresponding Weka 
implementations were used). We used five medium to 
large UCI datasets Il40l . commonly used for performance 
comparisons. Table provides a description of the data. 
For each dataset, we normalized the features in the range 
[0,1], and converted multi-class problems to binary, using 
the one-vs-all strategy optimized for the LibSVM system, as 
described in ill. The PSBML algorithm was run with the 
C9 neighborhood, a 3 x 3 grid, a replacement probability 
of 0.2, 20 training epochs, and a validation set size of 10%. 
We first optimized the base classifiers for performance, and 
then used the optimized settings in PSBML. Naive Bayes was 
used with the option of kernel estimation instead of using the 
default normal estimation; C4.5 was used with the default 
settings; and LibLinear was used with the L 2 loss function 
in both experiments. Each run, with the exception of Cover 
and C4.5, was repeated 30 times, and paired-t tests were 
used for statistical significance computation using the Area 
Under the Curve (AUC) |^1 as the metric. The experiments 
involving Cover and C4.5 were run only 10 times, due to 
the long processing time. Hence significance is not recorded 
in this case. Results are reported in Table All statistically 
significant results are marked in bold-face. 

We observe that PSBML, combined with the Naive Bayes 
classifier, performs statistically significantly better than the 
Naive Bayes classifier itself on all the datasets. Similar results 
were observed, and theoretical insights were provided, with 
regular boosting and Naive Bayes Il43l . Another important 
result to note is that the ensemble effect of PSBML makes the 
accuracy of a linear SVM significantly better (in three cases), 
while parallelizing the LibLinear SVM, which was already 
optimized for speed. 


Fig. 10. Synthetic datasets: (Left) Sine wave; (Right) Checkerboard. 

algorithms, in terms of training time, as a measure of speed, 
and in terms of accuracy, as a measure of performance. 
PSBML shares an important feature with SVMs: it reduces 
the training data to the points which are close to the 
boundary. Thus, we compared PSBML with a number of 
SVM implementations: a fast Newton based method, LP- 
SVM Ifl2l , a structural optimization-based technique, SVM- 
PERF fTS i (linear because with an RBF kernel it crashed), the 
most commonly used LibSVM HTl . a fast optimized LibLin¬ 
ear CD, a stochastic gradient based approximation method, 
SGDT (131 , and fast ball enclosure-based BVM (Til . We also 
compared PSBML against a parallel AdaBoost algorithm (441 
and the standard AdaBoostMl. All of the above mentioned 
implementations of SVMs incorporate some form of custom 
changes to boost the speed, like incremental sampling of 
the dataset, or simplifying the quadratic optimization, or 
assuming linearly separable data. 

The first dataset used for this experiment was a two 
dimensional decision boundary based on a sine wave gen¬ 
erated by the function /(x) = 2sin{2TTXi) (see Figure |To). 
The dimension x\ was sampled from the interval [0,6.28] 
and the y = /(x) dimension was randomly sampled from 
the interval [0,2]. The second dataset is a 4 x 4 rotated 
checkerboard data with alternate positive and negative 
classes as shown in Figure]^ Each dataset has one million 
instances, and all the experiments were repeated 30 times. 
We measured training time for each of the runs, and the 
average training time is reported. 10 fold cross-validation was 
performed for accuracy and the average accuracy is reported. 
Each algorithm was tuned to some level of optimality for 
comparisons, i.e. the soft margin parameter and the radius of 
the RBE kernel for SVMs were optimized using a grid search 
in the intervals [-5,15] and [3,-15], respectively. 

The PSBML algorithm was run with the C9 neighborhood, 
a 3 X 3 grid, replacement probability of 0.2, 10 training 
epochs, and a validation set size of 10% for each training 
fold. The C4.5 classifier with default parameters was used 
as it had an intermediate training speed between the fast 
LibLinear and the kernel estimated Naive Bayes. Results are 
shown in Table For both the synthetic datasets, PSBML 
gives the most accurate results with respect to the methods 
that have comparable training speed (i.e., LibLinear and 
LibSVM). Most of the techniques customized for high speed 
give poor accuracy results. The synthetic datasets, being 
highly non-linear, exaggerate the trade-offs implemented by 
the algorithms. 


6.3 Scalability Experiments 

The goal of this experiment is to validate whether PSBML 
performs competitively against custom optimized learning 


6.3.1 Real-world Dataset 

The KDD Cup 1999 intrusion detection dataset was used to 
compare the performance of the algorithms. The dataset 
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TABLE 5 

Training speed (in seconds) and accuracy for the Checkerboard and the 
Sine Wave datasets. 


Algorithm 

Checkerboard 
Speed Acc 

Sine Wave 
Speed Acc 

SVM 





LP-SVM (Linear) 

44.20 

50.23 

33.20 

68.80 

LP-SVM (RBF) 

33.20 

57.11 

105.56 

70.11 

LibLinear 

133.20 

50.08 

203.12 

68.60 

SGDT (10 iterations) 

4.20 

54.49 

4.20 

54.89 

SVM-PERF (Linear) 

1.10 

51.01 

2.01 

61.90 

BVM (RBF) 

1.80 

50.03 

1.20 

49.03 

LibSVM (RBF, 0.1%) 

136.20 

98.20 

423.23 

70.80 

Boosting 





AdaBoostMl 

38.21 

51.25 

30.71 

74.25 

ParalleAdalBoost 

17.90 

51.22 

13.90 

78.30 

(9 threads,10 iterations) 




PSBML 





PSBML (C4.5) 

123.10 

99.49 

193.10 

99.56 


TABLE 6 

Training speed in secs, mis-ciassification, area under ROC and PRC for 
the KDD Cup 1999 dataset. 

Algorithm Speed MisClass ROC PRC 


SVM 


LibLinear 

80.20 

25,447.3 

94.4 

6.3 

LibSVM (RBF, 1%) 

90.20 

25,517.8 

94.1 

76.9 

LibSVM (RBF, 10%) 

1,495.20 

25,366.1 

94.1 

13.1 

SGDT (10 iterations) 

211.10 

121,301 

- 

- 

SVM-PERF (Linear) 

4.90 

25,877.1 

93.1 

90.3 

BVM (RBF) 

3.20 

25,451.3 

- 

- 

Boosting 

AdaBoostMl 

13,296.42 

190,103.3 

88.4 

17.2 

ParallelAdaBoost 202.30 

(9 threads, 10 iterations) 

26,170.2 

36.2 

70.2 

PSBML 

PSBML(C4.5) 

2,913.10 

20,898.8 

95.6 

91.2 


contains 4,898,431 training instances. The problem was 
converted into a binary classification problem because many 
SVM implementations did not support multi-class labels. 
The feature set was also scaled within the range [0,1], which 
improved the performance of many SVMs almost 10 times. 
The PSBML algorithm was run with the C9 neighborhood, a 
3x3 grid, replacement probability of 0.2,10 training epochs, 
and a validation size of 0.1% of the training data. C4.5 was 
used with default parameters again for the same reasons 
mentioned earlier. 

In previous work, it was noted that many algorithms 
have a very similar error rate on this dataset. Hence, the 
number of mis-classifications was suggested and used as 
comparison metric m We do the same here. In addition, 
we measure the areas under the ROC and under the Precision 
Recall Curve (PRC), since the dataset is unbalanced. Each of 
the experiments was run 30 times, except the AdaBoostMl 
(only 10 times) due to large training time. The mean training 
times and the mean mis-classification averages are reported 
in Table|^ Some of the algorithms, e.g. LP-SVM, couldn't run 
with a 12GB RAM machine, because the loading of the data 
matrix itself failed. Also, for SGDT and BVM we couldn't 
compute the output probabilities to measure ROG and PRC 
due to the kernel choice. We observe that most algorithms 
that were optimized for speed had to trade-off accuracy. 


Average Training Times per Epoch (Log Scale) 
Linear Model Fit 
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Fig. 11. Mean training times per epoch with varying dataset sizes. 
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Fig. 12. Mean training times per epoch with varying threads. 


Also, the training time of LibSVM increased considerably 
when the sampled data went from 1% to 10%, with a small 
change in classification rate. The ROC value for PSBML 
was statistically significantly better; the value of the PRC 
area was comparable to that of SVM-PERF. In conclusion 
PSBML, while working on the entire dataset, finds a good 
classification rate at a considerable performance speed. 

To see the impact of data sizes on PSBML, we also selected 
training samples of various sizes from 50K, lOOA, 500iL, to 
one million. Ten runs were performed with standard PSBML 
with decision trees, a 3 x 3 grid, and the C9 neighborhood. 
Nine threads were used in this experiment. Training time 
(log scale) is plotted against data size in Figure 11 The 


graph clearly shows a steady linear scaling with data size. 
To see the impact of the multi-core processor described 
above on scalability, we changed the number of threads and 
computed the corresponding average training times. The 
result is given in Figure which shows again a consistent 
linear improvement with the number of threads. 

Another important aspect of a large scale learning al¬ 
gorithm is memory requirements. To evaluate this impact, 
we measured the memory usage with varying data sizes. 
We used the same data sizes and configuration as in the 
previous experiments. Figure shows the mean peak 
working memory during training as a function of different 
training data sizes. Again, this result shows a linear increase 
with the training data size, thus providing empirical evidence 
that the memory space complexity of PSBML is 0(n), where 
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Fig. 13. Mean peak working set memory with varying dataset sizes. 

n is the size of the training set. In comparison, SVMs are 
0{n^) 111. As such, PSBML has a key advantage also in terms 
of memory requirement. 

6.4 Comparison against AdaBoost and Impact of Noise 

Here we compare PSBML against AdaBoost and test the 
robustness in presence of noise. Previous work found that 
boosting is more susceptible to noise as compared to other 
ensemble methods like bagging and stacking Il46l . BtI . 
We added noise to the class labels by randomly changing 
different percentages of labels. We used AdaBoostMl both 
with decision stumps and with Naive Bayes (optimized 
using kernel estimators), and compared it against PSBML 
combined with the same underlying Naive Bayes classifier. 
PSBML was used with the default C9 neighborhood, replace¬ 
ment probability of 0.2, and validation set of 10%. 

We used the same datasets used for the meta-learning 
experiments, and did the same preprocessing. We performed 
30 runs to compare the three algorithms without noise, and 
in presence of 10% and 20% of noise. The results are shown 
in Table Statistically significant results are highlighted in 
boldface. 

In absence of noise, PSBML with Naive Bayes per¬ 
forms significantly better than AdaBoostMl with decision 
stumps or with the same optimized Naive Bayes in three 
of the five datasets. To measure how robust a method is 
across all the datasets, we compute the following quantity: 
impact = no-noise-'^noise), where N is the 

number of datasets. The smaller the value of the impact is 
for an algorithm, the more robust that method is on average. 

The impact values of AdaBoostMl (DecisionStump), 
AdaBoostMl (NaiveBayes), and PSBML (NaiveBayes) with 
10% noise are 4.41, 3.32, and 1.71, respectively. Similarly, 
with 20% noise the impact values for these algorithms 
are 5.02, 4.62, and 2.02, respectively. This shows that the 
PSBML algorithm is more robust to noise as compared to 
standard boosting. This is likely due to two reasons. First, 
in PSBML, the weighted sampling procedure is driven by 
the confidence of predictions only (prediction errors are not 
used), while AdaBoost credits larger weights to instances 
which are erroneously predicted. Second, PSBML makes use 
of a validation set to estimate the best classifier to be used 
for prediction of test instances, thus preventing overfitting. 


TABLE 7 

Performance of AdaBoostMl (DS: Decision Stump), AdaBoostMl (NB: 
Naive Bayes) and PSBML (NB: Naive Bayes) \witti no, 10%, and 20% 

noise. 


Adult W8AICJNN1 Cod Cover 


No Noise 






AdaBoostMl /DS 

87.10 

77.80 

93.40 

92.80 

75.70 

AdaBoostMl /NB 

87.20 

93.30 

84.30 

95.70 

85.30 

PSBML/NB 

90.69 

96.10 

81.79 

91.79 

87.31 

10% Noise 






AdaBoostMl /DS 

85.70 

58.90 

92.82 

92.20 

75.10 

AdaBoostMl /NB 

85.80 

83.40 

79.80 

95.10 

85.10 

PSBML/NB 

90.46 

96.01 

77.46 

88.06 

87.14 

20% Noise 






AdaBoostMl /DS 

85.10 

57.10 

92.30 

92.10 

75.10 

AdaBoostMl /NB 

84.88 

79.01 

79.70 

94.90 

84.20 

PSBML/NB 

90.10 

95.97 

77.42 

86.98 

87.11 


7 Conclusion 

The PSBML algorithm provides a general framework for 
parallelizing machine learning algorithms. The key contribu¬ 
tions of this paper are: (1) Establishing a theoretical statistical 
model for PSBML, (2) Proving that PSBML is a large margin 
classifier, and (3) Providing a comprehensive experimental 
study. Our empirical analysis confirmed the veracity of the 
theoretical model. 

The meta-leaming experiments have shown that PSBML 
exhibits characteristics similar to that of AdaBoost in the 
sense that adding ensemble boosting to a standard classifier 
produces at least comparable and often better results. Scalabil¬ 
ity experiments confirm that while maintaining good running 
times for training, the accuracy is not compromised. We have 
also shown a steady linear improvement in speed with an 
increasing number of threads, as well as linear training time 
and linear memory use as a function of data size. In addition, 
the spatial structure aspects of PSBML provide a resilience 
to noise, an important feature for real-world applications. 

There are several immediate extensions to this work. We 
are now adapting the algorithm to semi-supervised learning 
and unsupervised learning. In addition, we are exploring the 
possibihty of mapping PSBML onto distributed architectures 
like the Beowulf-style clusters in combination with map- 
reduce algorithms. 

8 Software and Data 

Software, data, and parameters used to perform 

the experiments in this paper are available at 

https://sites.google.com/site/psbml2013/ under an 

academic license. 
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